KV cache
KV cache trades space for time
Without KV cache: when generating token N, we need
-
- recompute the Key and Value matrices for all previous tokens (1 through N-1)
-
- compute attention using these Keys and Values
With KV cache:
-
- store the Key and Value vectors for each token after computing them once
-
- when generating new tokens, reuse the cached K,V vectors from previous tokens
-
- only compute new K,V vectors for the current token being generated
Optimization techniques:
- MQA/GQA: Reduce KV cache by sharing K,V across heads
- MLA: Compresses K,V into low-rank representations
- Sliding window: Only cache recent tokens
- KV cache quantization: Store K,V in lower precision (e.g., INT8)
Backlinks
Attention Mechanism Variants
compress Keys and Values into a low-rank latent space, reducing KV cache requirements